08 July, 2019
order of included R functions:
lm()summary()ggplot() + statqq()ggplot() + geom_smooth()tidy(), augment(), glance()summ(), plot_summs()autoplot()glm() (different families)In a linear regression, we aim to find a model:
The command in R for a linear model is
lm(y~x).
y is the outcome variable that interests us and x is the variable that we want to test in its association with y
Let’s first have a look at the summary table of the example data set, by using the summary() command:
## day1 day2 day3 ## Min. :-1.611 Min. : 0.2396 Min. : 0.00 ## 1st Qu.: 3.031 1st Qu.: 9.9763 1st Qu.:13.25 ## Median : 5.619 Median : 15.9973 Median :19.37 ## Mean : 5.428 Mean : 16.7252 Mean :20.27 ## 3rd Qu.: 7.772 3rd Qu.: 20.6984 3rd Qu.:28.05 ## Max. :12.142 Max. :143.8187 Max. :39.03 ## NA's :4 NA's :16
Before we start with the linear regression model, we need to get an idea of the underlying data and its distribution. We know that the linear regression has the assumtptions:
tki_demo %>% filter(day2 < 100) %>% gather(Days, measurement, day1:day3, factor_key=TRUE) %>% ggplot( aes(sample=measurement, color=Days)) + stat_qq() + facet_wrap(~Days)
lm() functionIn the plots, we could already see, that petal length and petal width seem to be associated. This is obvious when drawing a line in the plot. Let’s now perform a linear regression model in R.
lm(Petal.Length~Petal.Width, data=iris)
As said before, the first argument in the code is y, our outcome variable or dependent variable. In this case it is Petal.Length.
The second Argument is x, the independent variable. In our case: Petal.Width.
We also specify the data set that holds the variables we specified as x and y.
Now we want to look at the results of the linear regression. So how do we get the p-value and \(\beta\)-coefficient for the association?
In R, we add the summary() function to the lm() function, like so:
summary(lm(y~x, data=data))
## ## Call: ## lm(formula = Petal.Length ~ Petal.Width, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.33542 -0.30347 -0.02955 0.25776 1.39453 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.08356 0.07297 14.85 <2e-16 *** ## Petal.Width 2.22994 0.05140 43.39 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4782 on 148 degrees of freedom ## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266 ## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
jtools and broomlm() resultsThe output before contains a lot of relevant information, but it is not straighforward to access the individual parameters like p-values and betas The broom R package is in line with the “tidy” data handling in R and turns the linear model results into an easy accessible tibble format:
tidy(lm1)
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.08 0.0730 14.8 4.04e-31 ## 2 Petal.Width 2.23 0.0514 43.4 4.68e-86
The broom package helps with the accessibility of the output, but the style of the output is not very appealing for a publiation or a report. The jtools package helps with this and has other nice functionalities such as forrest plots for coefficients and confidence intervals:
export_summs(lm1)
| Model 1 | |
| (Intercept) | 1.08 *** |
| (0.07) | |
| Petal.Width | 2.23 *** |
| (0.05) | |
| N | 150 |
| R2 | 0.93 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
After running the linear regression, we need to test the quality of the model. At first, we look at the R2 statistic. The R2 statistic is a measure of how much variance in the data was explained by our model.
The R2 statistic ranges from 0 to 1, where 1 means the model explains 100% of the variance. This means the model fits the data perfectly.
library(ggfortify) autoplot(model)
Multiple linear regression works with the same function inR : lm(y ~ x + covar1 + covar2 + … + covarx , data=data) The R standard output is also very messy for reports, but helps with a first visual inspection in the command line:
summary(lm(day3 ~ day2 + male + intervention, data=tki_demo))
## ## Call: ## lm(formula = day3 ~ day2 + male + intervention, data = tki_demo) ## ## Residuals: ## Min 1Q Median 3Q Max ## -28.227 -1.813 -0.192 1.687 9.279 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 19.28189 1.18213 16.311 < 2e-16 *** ## day2 -0.01281 0.03689 -0.347 0.729 ## maleTRUE 1.48829 1.20570 1.234 0.221 ## interventionDrug 2 9.26868 1.33198 6.959 1.11e-09 *** ## interventionPlacebo -9.26457 1.40139 -6.611 4.93e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.771 on 75 degrees of freedom ## (20 observations deleted due to missingness) ## Multiple R-squared: 0.726, Adjusted R-squared: 0.7113 ## F-statistic: 49.67 on 4 and 75 DF, p-value: < 2.2e-16
With the demo data set:
export_summs(lm(day3 ~ day2 + male + smoker, data = tki_demo))
| Model 1 | |
| (Intercept) | 15.81 *** |
| (1.46) | |
| day2 | 0.16 ** |
| (0.06) | |
| maleTRUE | 7.11 *** |
| (1.89) | |
| smokerTRUE | -3.67 |
| (2.03) | |
| N | 80 |
| R2 | 0.24 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
With the demo data set:
lm1 <- lm(day3 ~ day2, data = tki_demo) lm2 <- lm(day3 ~ day2 + male + smoker, data = tki_demo) export_summs(lm1, lm2)
| Model 1 | Model 2 | |
| (Intercept) | 17.25 *** | 15.81 *** |
| (1.41) | (1.46) | |
| day2 | 0.16 ** | 0.16 ** |
| (0.06) | (0.06) | |
| maleTRUE | 7.11 *** | |
| (1.89) | ||
| smokerTRUE | -3.67 | |
| (2.03) | ||
| N | 80 | 80 |
| R2 | 0.09 | 0.24 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
jtools package has a nice function to do this very easily, utilising ggplot2:plot_summs()
plot_summs(lm1)
plot_summs(lm1, lm2)
plot_summs(lm1, lm2, coefs="day2")
With jtools we can access more information from the model in an easier step. Here, we access the confidence intervall and variance inflation factor (for multicollinearity testing), but leave out the p-values:
summ(lm2, scale = TRUE, vifs = TRUE, part.corr = TRUE, confint = TRUE, pvals = FALSE)$coeftable
## Est. 2.5% 97.5% t val. VIF partial.r ## (Intercept) 18.569528 16.2225902 20.9164657 15.758586 NA NA ## day2 2.568780 0.7826129 4.3549470 2.864328 1.021056 0.3121443 ## maleTRUE 7.107081 3.3480942 10.8660688 3.765636 1.041817 0.3965366 ## smokerTRUE -3.665084 -7.7050377 0.3748704 -1.806864 1.054610 -0.2029483 ## part.r ## (Intercept) NA ## day2 0.2862919 ## maleTRUE 0.3763784 ## smokerTRUE -0.1805975